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Abstract 

We propose an end-to-end, domain- 
independent neural encoder-aligner-decoder 
model for selective generation, i.e., the 
joint task of content selection and surface 
realization. Our model first encodes a full 
set of over-determined database event records 
via an LSTM-based recurrent neural network, 
then utilizes a novel coarse-to-fine aligner to 
identify the small subset of salient records to 
talk about, and finally employs a decoder to 
generate free-form descriptions of the aligned, 
selected records. Our model achieves the 
best selection and generation results reported 
to-date (with 59% relative improvement in 
generation) on the benchmark Weather- 
GOV dataset, despite using no specialized 
features or linguistic resources. Using an 
improved fc-nearest neighbor beam filter 
helps further. We also perform a series of 
ablations and visualizations to elucidate the 
contributions of our key model components. 
Lastly, we evaluate the generalizability of 
our model on the ROBOCUP dataset, and get 
results that are competitive with or better than 
the state-of-the-art, despite being severely 
data-starved. 

1 Introduction 

We consider the important task of producing a natu¬ 
ral language description of a rich world state rep¬ 
resented as an over-determined database of event 
records. This task, which we refer to as selective 
generation, is often formulated as two subproblems: 
content selection, which involves choosing a sub¬ 
set of relevant records to talk about from the ex¬ 


haustive database, and surface realization, which is 
concerned with generating natural language descrip¬ 
tions for this subset. Learning to perform these tasks 
jointly is challenging due to the ambiguity in decid¬ 
ing which records are relevant, the complex depen¬ 
dencies between selected records, and the multiple 
ways in which these records can be described. 

Previous work has made significant progress on 
this task (Chen and Mooney, 2008; Angeli et al., 
2010; Kim and Mooney, 2010; Konstas and Lap- 
ata, 2012). However, most approaches solve the 
two content selection and surface realization sub¬ 
tasks separately, use manual domain-dependent re¬ 
sources (e.g., semantic parsers) and features, or em¬ 
ploy template-based generation. This limits do¬ 
main adaptability and reduces coherence. We take 
an alternative, neural encoder-aligner-decoder ap¬ 
proach to free-form selective generation that jointly 
performs content selection and surface realization, 
without using any specialized features, resources, or 
generation templates. This enables our approach to 
generalize to new domains. Further, our memory- 
based model captures the long-range contextual de¬ 
pendencies among records and descriptions, which 
are integral to this task (Angeli et al., 2010). 

We formulate our model as an encoder-aligner- 
decoder framework that uses recurrent neural net¬ 
works with long short-term memory units (LSTM- 
RNNs) (Hochreiter and Schmidhuber, 1997) to¬ 
gether with a coarse-to-fine aligner fo selecf and 
“franslafe” fhe rich world sfafe info a nafural lan¬ 
guage descripfion. Our model firsl encodes fhe full 
sef of over-defermined even! records using a bidirec¬ 
tional LSTM-RNN. A novel coarse-lo-fine aligner 
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then reasons over multiple abstraetions of the input 
to deeide whieh of the reeords to diseuss. The model 
next employs an LSTM deeoder to generate natural 
language deseriptions of the seleeted reeords. 

The use of LSTMs, whieh have proven effeetive 
for similar long-range generation tasks (Sutskever et 
ah, 2014; Vinyals et ah, 2015b; Karpathy and Fei- 
Fei, 2015), allows our model to eapture the long- 
range eontextual dependeneies that exist in selee- 
tive generation. Further, the introduetion of our pro¬ 
posed variation on alignment-based LSTMs (Bah- 
danau et ah, 2014; Xu et ah, 2015) enables our 
model to learn to perform eontent seleetion and sur- 
faee realization jointly, by aligning eaeh generated 
word to an event reeord during deeoding. Our novel 
eoarse-to-fine aligner avoids searehing over the full 
set of over-determined reeords by employing two 
stages of inereasing eomplexity: a pre-seleetor and 
a refiner aeting on multiple abstraetions (low- and 
high-level) of the reeord input. The end-to-end na¬ 
ture of our framework has the advantage that it ean 
be trained direetly on eorpora of reeord sets paired 
with natural language deseriptions, without the need 
for ground-truth eontent seleetion. 

We evaluate our model on a benehmark weather 
foreeasting dataset (WeatherGov) and aehieve 
the best results reported to-date on eontent seleetion 
(12% relative improvement in F-1) and language 
generation (59% relative improvement in BLEU), 
despite using no domain-speeifie resourees. We 
also perform a series of ablations and visualiza¬ 
tions to elueidate the eontributions of the primary 
model eomponents, and also show improvements 
with a simple, /c-nearest neighbor beam filter ap- 
proaeh. Finally, we demonstrate the generalizability 
of our model by direetly applying it to a benehmark 
sportseasting dataset (ROBOCUP), where we get re¬ 
sults eompetitive with or better than state-of-the-art, 
despite being extremely data-starved. 

2 Related Work 

Seleetive generation is a relatively new researeh area 
and more attention has been paid to the individ¬ 
ual eontent seleetion and seleetive realization sub¬ 
problems. With regards to the former, Barzilay and 
Lee (2004) model the eontent strueture from unan¬ 
notated doeuments and apply it to the applieation 


of text summarization. Barzilay and Lapata (2005) 
treat eontent seleetion as a eolleetive elassifieation 
problem and simultaneously optimize the loeal label 
assignment and their pairwise relations. Liang et al. 
(2009) address the related task of aligning a set of 
reeords to given textual deseription elauses. They 
propose a generative semi-Markov alignment model 
that jointly segments text sequenees into utteranees 
and assoeiates eaeh to the eorresponding reeord. 

Surfaee realization is often treated as a problem 
of produeing text aeeording to a given grammar. 
Sorieut and Mareu (2006) propose a language gen¬ 
eration system that uses the WIDL-representation, 
a formalism used to eompaetly represent probabil¬ 
ity distributions over finite sets of strings. Wong 
and Mooney (2007) and Lu and Ng (2011) use syn- 
ehronous eontext-free grammars to generate natural 
language sentenees from formal meaning represen¬ 
tations. Similarly, Belz (2008) employs probabilis- 
tie eontext-free grammars to perform surfaee real¬ 
ization. Other effeetive approaehes inelude the use 
of tree eonditional random fields (Lu el ah, 2009) 
and lemplafe exlraelion wilhin a log-linear frame¬ 
work (Angeli ef ah, 2010). 

Reeenf work seeks fo solve fhe full seleefive 
generation problem fhrough a single framework. 
Chen and Mooney (2008) and Chen el al. (2010) 
learn alignmenls belween eommenls and Iheir eor¬ 
responding evenl reeords using a Iranslalion model 
for parsing and generation. Kim and Mooney (2010) 
implemenl a Iwo-slage framework lhal deeides whal 
fo diseuss using a eombinalion of fhe melhods of 
Lu el al. (2008) and Liang el al. (2009), and fhen 
produees fhe lexl based on fhe generalion syslem of 
Wong and Mooney (2007). 

Angeli el al. (2010) propose a unified eoneepl- 
fo-lexl model lhal Ireals joinl eonfenl seleetion and 
surfaee realizalion as a sequenee of loeal deeisions 
represenled by a log-linear model. Similar fo olher 
work, Ihey Irain Iheir model using exlernal align¬ 
menls from Liang el al. (2009). Generation Ihen fol¬ 
lows as inferenee over Ibis model, where Ihey lirsl 
ehoose an evenl reeord, Ihen Ihe reeord’s fields (i.e., 
allribules), and finally a sel of templates lhal Ihey 
Ihen fill in wilh words for Ihe seleeted fields. Their 
abilily fo model long-range dependeneies relies on 
Iheir ehoiee of fealures for Ihe log-linear model, 
while Ihe template-based generation furlher employs 
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some domain-specific features for fluent output. 

Konstas and Lapata (2012) propose an alternative 
method that simultaneously optimizes the content 
selection and surface realization problems. They 
employ a probabilistic context-free grammar that 
specifies the structure of the event records, and then 
treat generation as finding the best derivation tree 
according to this grammar. However, their method 
still selects and orders records in a local fashion via 
a Markovized chaining of records. Konstas and La¬ 
pata (2013) improve upon this approach with global 
document representations. However, this approach 
also requires alignment during training, which they 
estimate using the method of Liang et al. (2009). 

We treat the problem of selective generation as 
end-to-end learning via a recurrent neural network 
encoder-aligner-decoder model, which enables us 
to jointly learn content selection and surface re¬ 
alization directly from database-text pairs, without 
the need for an external aligner or ground-truth se¬ 
lection labels. The use of LSTM-RNNs enables 
our model to capture the long-range dependencies 
that exist among the records and natural language 
output. Additionally, the model does not rely on 
any manually-selected or domain-dependent fea¬ 
tures, templates, or parsers, and is thereby general- 
izable. The alignment-RNN approach has recently 
proven successful for generation-style tasks, e.g., 
machine translation (Bahdanau et ah, 2014) and im¬ 
age captioning (Xu et ah, 2015). Since selective 
generation requires identifying the small number of 
salient records among an over-determined database, 
we avoid performing exhaustive search over the full 
record set, and instead propose a novel coarse-to- 
fine aligner that divides the search complexity into 
pre-selection and refinement stages. 

3 Task Definition 

We consider the problem of generating a natural 
language description for a rich world state speci¬ 
fied in terms of an over-determined set of records 
(database). This problem requires deciding which 
of the records to discuss (content selection) and 
how to discuss them (surface realization). Train¬ 
ing data consists of scenario pairs (^(*)^^(*)) fQJ. 
i = 1, 2,..., n, where is the complete set of 
records and is the natural language description 


temperature(time= 17-06, min= 4 8, mean= 5 3, max= 61) 
windSpeed(time=17-0 6, min=3, mean=6, max=ll) 
windDir(time=17-0 6, mode=SSW) 
gust(time=17-06, min=0, mean=0, max=0) 

ri,N'- 

skyCover(time= 17-21, mode=0-25) 
skyCover(time=0 2 - 0 6, mode=7 5-100) 
precipChance(time=17-0 6, min=2, mean=14, max=20) 
rainChance(time=l 7-0 6, mode=someChance) 

“a 20 percent chance of showers after midnight, increas- 
X\ J^\ ing clouds, with a low around 48 southwest wind between 
5 and 10 mph” 

(a) WeatherGov 

pass(argl=purple6, arg2=purple3) 
kick(arg l=purple3) 

ri-.N'- 

badPass(argl=purple3, arg2=pink9) 
turnover(argl=purple3, arg2=pink9) 

“purple3 made a bad pass that was picked off by pink9” 

(b) RoboCup 

Figure 1: Sample database-text pairs chosen from the 
(a) WeatherGov and (b) RoboCup datasets. 

(Fig. 1). At test time, only the records are given. We 
evaluate our model in the context of two publicly- 
available benchmark selective generation datasets. 

WeatherGov The weather forecasting dataset 
(see Fig. 1(a)) of Liang et al. (2009) consists of 
29528 scenarios, each with 36 weather records (e.g., 
temperature, sky cover, etc.) paired with a natural 
language forecast ( 28.7 avg. word length). 

RoboCup We evaluate our model’s generaliz- 
ability on the sportscasting dataset of Chen and 
Mooney (2008), which consists of only 1539 pairs 
of temporally ordered robot soccer events (e.g., pass, 
score) and commentary drawn from the four-game 
2001-2004 RoboCup finals (see Fig. 1(b)). Each 
scenario contains an average of 2.4 event records 
and a 5.7 word natural language commentary. 

4 The Model 

We formulate selective generation as inference 
over a probabilistic model P{xi-,T\f’i-.N), where 
f’i; 7 V = (n, P 2 , • • •, Tat) is the input set of over- 
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ri r2 tn 

Figure 2: Our model architecture with a bidirectional 
LSTM encoder, coarse-to-fine aligner, and decoder. 


determined event records,^ xi:T = {xi,X2, ■ ■ ■, xt) 
is the generated description with xt being the word 
at time t and xq being a special start token: 

xl-rp = arg max P(xi;r|n:7v) (la) 

^l-.T 

T 

= arg max n -P(xt|xo:t-i, ri:Ar) (lb) 

Xl:T 

The goal of inference is to generate a natural lan¬ 
guage description for a given set of records. An 
effective means of learning to perform this gen¬ 
eration is to use an encoder-aligner-decoder archi¬ 
tecture with a recurrent neural network, which has 
proven effective for related problems in machine 
translation (Bahdanau et ah, 2014) and image cap¬ 
tioning (Xu et ah, 2015). We propose a variation on 
this general model with novel components that are 
well-suited to the selective generation problem. 

Our model (Fig. 2) first encodes each input record 
Tj into a hidden state hj with j G {1,..., iV} us¬ 
ing a bidirectional recurrent neural network (RNN). 
Our novel coarse-to-fine aligner then acts on a con¬ 
catenation rrij of each record and its hidden state 

'These records may take the form of an unordered set 
or have a natural ordering (e.g., temporal in the case of 
RoboCup). In order to make our model generalizable, we treat 
the set as a sequence and use the order specified by the dataset. 
We note that it is possible that a different ordering will yield 
improved performance, since ordering has been shown to be 
important when operating on sets (Vinyals et al., 2015a). 


as multi-level representation of the input to compute 
the selection decision zt at each decoding step t. The 
model then employs an RNN decoder to arrive at the 
word likelihood P{xt\xo:t-i,ri:N) as a function of 
the multi-level input and the hidden state of the de¬ 
coder st-i at time step t — 1. In order to model the 
long-range dependencies among the records and de¬ 
scriptions (which is integral to effectively perform¬ 
ing selective generation (Angeli et ah, 2010; Kon- 
stas and Lapata, 2012; Konstas and Lapata, 2013)), 
our model employs LSTM units as the nonlinear en¬ 
coder and decoder functions. 

Encoder Our LSTM-RNN encoder (Fig. 2) 
takes as input the set of event records rep¬ 
resented as a sequence ri:jv = (xi, r 2 ,..., rjv) 
and returns a sequence of hidden annotations 
hi:N = {hi, h 2 , ■ ■ ■, Hn), where the annotation hj 
summarizes the record rj. This results in a represen¬ 
tation that models the dependencies that exist among 
the records in the database. 

We adopt an encoder architecture similar to that 
of Graves et al. (2013) 




( ^ \ 

// 


a 



a 

\9p 


\tanh/ 



c^ = f!Qc^.i + ^Q9^ 

hj = o'j Q tanh(cj) 


(2a) 

(2b) 

(2c) 


where T® is an affine transformation, a is the logis¬ 
tic sigmoid that restricts its input to [0,1], ij, /?, 
and Oj are the input, forget, and output gates of the 
LSTM, respectively, and Cj is the memory cell acti¬ 
vation vector. The memory cell Cj summarizes the 
LSTM’s previous memory Cj_i and the current in¬ 
put, which are modulated by the forget and input 
gates, respectively. Our encoder operates bidirec¬ 
tionally, encoding the records in both the forward 
and backward directions, which provides a better 
summary of the input records. In this way, the hid¬ 
den annotations hj = {hJ; hj)~^ concatenate for- 

ward h j and backward h j annotations, each deter¬ 
mined using Equation (2c). 

Coarse-to-Fine Aligner Having encoded the in¬ 
put records ri-^ to arrive at the hidden annotations 


4 










hi:N, the model then seeks to seleet the eontent at 
eaeh time step t that will be used for generation. Our 
model performs eontent seleetion using an extension 
of the alignment meehanism proposed by Bahdanau 
et al. (2014), whieh allows for seleetion and genera¬ 
tion that is independent of the ordering of the input. 

In seleetive generation, the given set of event 
reeords is over-determined with only a small sub¬ 
set of salient reeords being relevant to the out¬ 
put natural language deseription. Standard align¬ 
ment meehanisms limit the aeeuraey of seleetion 
and generation by seanning the entire range of over¬ 
determined reeords. In order to better address the 
seleetive generation task, we propose a eoarse-to- 
fine aligner that prevents the model from being dis- 
traeted by non-salient reeords. Our model aligns 
based on multiple abstraetions of the input: both the 
original input reeord as well as the hidden annota¬ 
tions rrij = {rj ;hj)'^, an approaeh that has previ¬ 
ously been shown to yield better results than align¬ 
ing based only on the hidden state (Mei et ah, 2015). 

Our eoarse-to-fine aligner avoids searehing over 
the full set of over-determined reeords by using two 
stages of inereasing eomplexity: a pre-seleetor and 
refiner (Fig. 2). The pre-seleetor first assigns to eaeh 
reeord a probability pj of being seleeted, while the 
standard aligner eomputes the alignment likelihood 
wtj over all the reeords at eaeh time step t during 
decoding. Next, the refiner produces fhe final se¬ 
lection decision by re-weighting the aligner weights 
Wtj with the pre-selector probabilities pj: 


Pj = sigmoid tanh(Pmj) j (3a) 

Ptj = n''^tanh(VFst_i + Umj) (3b) 

Wtj = exp{/3tj)/ ^ ewiPtj) (3c) 

j 

atj = PjWtj /^ PjWtj (3d) 

j 

zt = '^ ottjmj (3e) 

j 


where P, q, U, W, v are learned parameters. Ideally, 
the selection decision would be based on the highest- 
value alignment zt = nik where k = arg max^ atj. 
However, we use the weighted average (Eqn. 3e) as 
its soft approximation to maintain differentiability of 
the entire architecture. 


The pre-selector assigns large values (pj > 0.5) 
to a small subset of salient records and small val¬ 
ues ipj < 0.5) to the rest. This modulates the stan¬ 
dard aligner, which then has to assign a large weight 
Wtj in order to select the j-th record at time t. In 
this way, the learned prior pj makes it difficult for 
the alignment (attention) to be distracted by non¬ 
salient records. Further, we can relate the output 
of the pre-selector to the number of records that are 
selected. Specifically, the output pj expresses the 
extent to which the j-th record should be selected. 
The summation Y2f=iPj •^hen be regarded as 
a real-valued approximation to the total number of 
pre-selected records (denoted as 7), which we regu¬ 
larize towards, based on validation (see Eqn. 5). 

Decoder Our architecture uses an ESTM decoder 
that takes as input the current context vector zt, 
the last word xt-i, and the ESTM’s previous hid¬ 
den state st-i- The decoder outputs the conditional 
probability distribution = P{xt\xo:t-i,ri:N) 
over the next word, represented as a deep output 
layer (Pascanu et ah, 2014), 



^ \ 

ff ^ ^ 

oi a 

\gfj \tanh/ 

4 = ffQcti+ifQ9f 

St = of Q tanh(cf) 

It — T LsSt T Lz^t) 

Px,t = softmax {It) 


(4a) 

(4b) 

(4c) 

(4d) 

(4e) 


where E (an embedding matrix), Lq, Eg, and are 
parameters to be learned. 

Training and Inference We train the model us¬ 
ing the database-record pairs {r\-,N, xi-,t) from the 
training corpora so as to maximize the likelihood of 
the ground-truth language description xIt (Eqn. 1). 
Additionally, we introduce a regularization term 
Pj ~ 7 )^ •^hat enables the model to influence 
the pre-selector weights based on the aforemen¬ 
tioned relationship between the output of the pre¬ 
selector and the number of selected records. More¬ 
over, we also introduce the term (1.0 — max(pj)), 
which accounts for the fact that at least one record 
should be pre-selected. Note that when 7 is equal to 
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N, the pre-selector is forced to select all the records 
(j)j = 1.0 for all j), and the coarse-to-fine alignment 
reverts to the standard alignment introduced by Bah- 
danau et al. (2014). Together with the negative log- 
likelihood of the ground-truth description our 

loss function becomes 

L =-\ogP{xl.rp\ri.,N)+ G (5a) 

T 

=-'^\ogP{x^\xo:t-i,ri:N) + G (5b) 

t=l 

G = “ max(pj)) (5c) 

Having trained the model, we generate the natural 
language description by finding the maximum a pos¬ 
teriori words under the learned model (Eqn. 1). For 
inference, we perform greedy search starting with 
the first word xi. Beam search offers a way to per¬ 
form approximate joint inference — however, we 
empirically found that beam search does not perform 
any better than greedy search on the datasets that we 
consider, an observation that is shared with previous 
work (Angeli et ah, 2010). We later discuss an al¬ 
ternative fc-nearest neighbor-based beam filter (see 
Sec 6.2). 

5 Experimental Setup 

Datasets We analyze our model on the benchmark 
WeatherGov dataset, and use the data-starved 
RoboCup dataset to demonstrate the model’s gen- 
eralizability. Following Angeli et al. (2010), we 
use WeatherGov training, development, and test 
splits of size 25000, 1000, and 3528, respectively. 
For ROBOCUP, we follow the evaluation method¬ 
ology of previous work (Chen and Mooney, 2008), 
performing three-fold cross-validation whereby we 
train on three games (approximately 1000 scenarios) 
and test on the fourth. Within each split, we hold 
out 10% of the training data as the development set 
to tune the early-stopping criterion and 7 . We then 
report the standard average performance (weighted 
by the number of scenarios) over these four splits. 

Training Details On WeatherGov, we lightly 
tune the number of hidden units and 7 on the de¬ 
velopment set according to the generation metric 


(BFEU), and choose 500 units from {250, 500, 750} 
and 7 = 8.5 from 16.5,7.5,8.5,10.5,12.5}. For 
RoboCup, we only tune 7 on the development set 
and choose 7 = 5.0 from the set {1.0, 2.0,..., 6.0}. 
However, we do not retune the number of hidden 
units on ROBOCUP. For each iteration, we ran¬ 
domly sample a mini-batch of 100 scenarios during 
back-propagation and use Adam (Kingma and Ba, 
2015) for optimization. Training typically converges 
within 30 epochs. We select the model according to 
the BFEU score on the development set.^ 

Evaluation Metrics We consider two metrics as a 
means of evaluating the effectiveness of our model 
on the two selective generation subproblems. For 
content selection, we use the F-1 score of the set of 
selected records as defined by fhe harmonic mean of 
precision and recall wifh respecf fo fhe ground-frufh 
selecfion record sef. We define fhe sef of selecfed 
records as consisfing of fhe record wifh fhe largesf 
selecfion weighf au compufed by our aligner al each 
decoding slep t. 

We evaluale fhe qualify of surface realizalion us¬ 
ing fhe BFEU score^ (a 4-gram malching-based pre¬ 
cision) (Papineni ef ah, 2001) of fhe generaled de- 
scriplion wifh respecf lo fhe human-crealed refer¬ 
ence. To be comparable fo previous resulls on 
WeatherGov, we also consider a modified BFEU 
score (cBFEU) lhal does nol penalize numerical de- 
vialions of al mosl five (Angeli el ah, 2010) (i.e., 
lo nol penalize “low around 58” compared lo a ref¬ 
erence “low around 60”). On ROBOCUP, we also 
evaluale fhe BFEU score in fhe case lhal ground- 
frufh confenl selecfion is known (sBFEUg), to be 
comparable to previous work. 

6 Results and Analysis 

We analyze the effectiveness of our model on 
the benchmark WeatherGov (as primary) and 
RoboCup (as generalization) datasets. We also 
present several ablations to illustrate the contribu¬ 
tions of the primary model components. 


^We implement our model in Theano (Bergstra et al., 2010; 
Bastien et al., 2012) and will make the code publicly available. 

^We compute BLEU using the publicly available evaluation 
provided by Angeli et al. (2010). 
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Table 1: Primary WeatherGOV results 


Method 

F-1 

sBLEU 

cBLEU 

KL12 

— 

33.70 

— 

KL13 

- 

36.54 


ALKIO 

65.40 

38.40 

51.50 

Our model 

73.21 

61.01 

70.39 


6.1 Primary Results (WeatherGov) 

We report the performanee of eontent seleetion and 
surfaee realization using F-1 and two BLEU seores 
(standard sBLEU and the eustomized eBEEU of An- 
geli et al. (2010)), respeetively (See. 5). Table 1 
eompares our test results against previous meth¬ 
ods that inelude KE12 (Konstas and Eapata, 2012), 
KE13 (Konstas and Eapata, 2013), and AEKIO (An- 
geli et ah, 2010). Our method aehieves the best 
results reported to-date on all three metries, with 
relative improvements of 11.94% (E-1), 58.88% 
(sBEEU), and 36.68% (eBEEU) over the previous 
state-of-the-art. 

6.2 Beam Filter with /c-Nearest Neighbors 

We eonsidered beam seareh as an alternative to 
greedy seareh in our primary setup (Eqn. 1), but 
this performs worse, similar to what previous work 
found on this dataset (Angeli et ah, 2010). As an 
alternative, we eonsider a beam filter based on a k- 
nearest neighborhood. See Supplementary Material 
for details. Table 9 shows that this fc-NN beam filter 
improves results over the primary greedy results. 


Table 2: fc-NN beam filter (test set) 



Primary 

fc-NN Beam Filter 

sBLEU 

61.01 

61.76 

cBLEU 

70.39 

71.23 


6.3 Ablation Analysis (WeatherGov) 

Next, we present several ablations to analyze the 
eontribution of our model eomponents.^ 

Aligner Ablation Eirst, we evaluate the eontribu¬ 
tion of our proposed eoarse-to-fine aligner by eom- 
paring our model with the basie eneoder-aligner- 
decoder model introdueed by Bahdanau et al. 

"'These results are based on our primary model of Sec. 6.1 
and on the development set. 


(2014). Table 3 reports the results demonstrating 
that our aligner yields superior E-1 and BEEU seores 
relative to a standard aligner. 


Table 3: Coarse-to-fine aligner ablation (dev set) 


Method 

F-1 

sBLEU 

cBLEU 

Basic 

60.35 

63.54 

74.90 

Coarse-to-fine 

76.28 

65.58 

75.78 


Encoder Ablation Next, we eonsider the effee- 
tiveness of the eneoder. Table 4 eompares the results 
with and without the eneoder on the development 
set, and demonstrates that there is a signiheant gain 
from eneoding the event reeords using the ESTM- 
RNN. We attribute this improvement to the ESTM- 
RNN’s ability to eapture the relationships that exist 
among the reeords, whieh is known to be essential 
to seleetive generation (Barzilay and Eapata, 2005; 
Angeli et ah, 2010). 


Table 4: Encoder ablation (dev set) 


Encoder 

F-1 

sBLEU 

cBLEU 

With 

76.28 

65.58 

75.78 

Without 

57.45 

56.47 

68.63 


6.4 Qualitative Analysis (WeatherGov) 

Output Examples Eig. 3 shows an example 
reeord set with its output deseription and reeord- 
word alignment heat map. As shown, our model 
learns to align reeords with their eorresponding 
words (e.g., windDir and “southeast,” temperature 
and “71,” windSpeed and “wind 10,” and gust and 
“winds eould gust as high as 30 mph”). It also learns 
the subset of salient reeords to talk about (matehing 
the ground-truth deseription perfeetly for this ex¬ 
ample, i.e., a standard BEEU of 100.00). We also 
see some word-level mismateh, e.g., “eloudy” mis¬ 
aligns to id-0 temp and id-10 preeipChanee, whieh 
we attribute to the high eorrelation between these 
types of reeords (“garbage eolleetion” in Eiang et 
al. (2009)). 

Word Embeddings Training our deeoder has the 
effeet of learning embeddings for the words in the 
training set (via the embedding matrix E in Eqn. 4). 
Here, we explore the extent to whieh these learned 
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■ 

Record details: 

id-0: temperature(time=06-21, min=52, mean=63, inax=71); id-2: windSpeed(time=0 6-21, min=8, mean=17, max=23); 

id-3: windDir(time=06-21, mode=SSE); id-4: gust(time=06-21, min=0, mean=10, max=30); 
id-5: skyCover(time=6-21, mode=50-7 5); id-10: precipChance(time=0 6-21, min=19, mean=32, max=73); 

id-15: thunderChance(time=13-21, mode=SChc) 

Figure 3; An example generation for a set of records from WeatherGov. 


embeddings capture semantic relationships among 
the training words. Table 10 presents nearest neigh¬ 
bor words for some of the common words from the 
WeatherGov dataset (according to cosine similar¬ 
ity in the embedding space). More details of other 
embedding approaches that we tried are discussed 
in the Supplementary Material section. 

Table 5: Nearest neighbor word for example words 


Word 

Nearest neighbor 

gusts 

gust 

clear 

sunny 

isolated 

scattered 

southeast 

northeast 

storms 

winds 

decreasing 

falling 


6.5 Out-of-Domain Results (RoboCup) 

We use the RoboCup dataset to evaluate the 
domain-independence of our model. The dataset 
is severely data-starved with only 1000 (approx.) 
training pairs, which is much smaller than is typi¬ 
cally necessary to train RNNs. This results in higher 
variance in the trained model dishibutions, and we 
thus adopt the standard denoising method of ensem¬ 
bles (Sutskever et al., 2014; Vinyals et al., 2015b; 
Zaremba et al., 2014).^ 

^We use an ensemble of five randomly initialized models. 


Table 6: ROBOCUP results 


Method 

F-1 

sBLEU 

sBLEUg 

CMOS 

72.00 

- 

28.70 

LJK09 

75.70 

- 


CKMIO 

79.30 


- 

ALKIO 

79.90 

- 

28.80 

KL12 

- 

24.88 

30.90 

Our model 

81.58 

25.28 

29.40 


Following previous work, we perform two exper¬ 
iments on the RoboCup dataset (Table 6), the first 
considering full selective generation and the second 
assuming ground-truth content selection at test time. 
On the former, we obtain a standard BLEU score 
(sBLEU) of 25.28, which exceeds the best score of 
24.88 (Konstas and Eapata, 2012). Additionally, 
we achieve an selection E-1 score of 81.58, which 
is also the best result reported to-date. In the case 
of assumed (known) ground-truth content selection, 
our model attains an sBEEUq score of 29.40, which 
is competitive with the state-of-the-art.^ 

7 Conclusion 

We presented an encoder-aligner-decoder model for 
selective generation that does not use any spe¬ 
cialized features, linguistic resources, or genera- 

®The Chen and Mooney (2008) sBLEUg result is from An¬ 
gel! et al. (2010). 























tion templates. Our model employs a bidiree- 
tional LSTM-RNN model with a novel eoarse-to- 
fine aligner that jointly learns eontent seleetion and 
surfaee realization. We evaluate our model on 
the benehmark WeatherGov dataset and aehieve 
state-of-the-art seleetion and generation results. We 
aehieve further improvements via a /c-nearest neigh¬ 
bor beam filter. We also present several model ab¬ 
lations and visualizations to elueidate the effeets of 
the primary eomponents of our model. Moreover, 
our model generalizes to a different, data-starved do¬ 
main (ROBOCUP), where it aehieves results eompet- 
itive with or better than the state-of-the-art. 
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A Supplementary Material 

The following provides further evaluations of our 
model as a supplement to our original manuscript. 

A,1 Beam Filter with A:-Nearest Neighbors 

We perform greedy search as an approximation to 
full inference over the set of decision variables 
(Eqn. 1). We considered beam search as an alterna¬ 
tive, but as with previous work on this dataset (An¬ 
geli et al., 2010), we found that greedy search still 
yields better BLEU performance (Table 7). 


Table 7: Effect of beam width 


Beam width M 

1 

2 

5 

10 

dev sBLEU 

65.58 

64.70 

57.02 

47.07 

dev cBLEU 

75.78 

74.91 

65.83 

54.19 

test sBLEU 

61.01 

60.15 

53.70 

44.32 

test cBLEU 

70.39 

69.42 

61.95 

51.01 


As an alternative, we consider a beam filter based 
on a /c-nearest neighborhood. Eirst, we generate the 
M-best description candidates (i.e., a beam width 
of M) for a given input record set (database) us¬ 
ing standard beam search. Next, we find fhe K 
nearesf neighbor dafabase-descripfion pairs from fhe 
fraining dafa, based on fhe cosine similarify of each 
neighbor dafabase wifh fhe given inpuf record. We 
fhen compufe fhe BLEU score for each of fhe M de- 
scripfion candidafes relafive fo fhe K nearesf neigh¬ 


bor descriptions (as references) and selecf fhe candi- 
dafe wifh fhe highesf BLEU score. We fune K and 
M on fhe developmenf sef and reporf fhe resulfs in 
Table 8. Table 9 presenfs fhe fesf resulfs wifh fhis 
funed setting (M = 2, K = 1), where we achieve 
BLEU scores beffer fhan our primary greedy resulfs. 


Table 8: /c-NN beam filter (dev set) 


sBLEU 

M = 2 

M = 5 

M = 10 

K = 1 

65.99 

65.88 

65.65 

K = 2 

65.89 

65.98 

65.83 

K = 5 

65.64 

65.45 

65.41 

K = 10 

65.91 

65.89 

65.12 

cBLEU 

M = 2 

M = 5 

M = 10 

K = 1 

76.21 

76.13 

75.98 

K = 2 

75.99 

76.03 

75.82 

K = 5 

75.90 

75.63 

75.41 

K = 10 

75.95 

75.87 

75.23 


Table 9: fc-NN beam filter (test set) 



Primary 

fc-NN {M = 2, K=l) 

sBLEU 

61.01 

61.76 

cBLEU 

70.39 

71.23 


A.2 Word Embeddings (Trained & Pretrained) 

Training our decoder has the effect of learning em¬ 
beddings for the words in the training set (via the 
embedding matrix E in Eqn. 4). Here, we ex¬ 
plore the extent to which these learned embeddings 
capture semantic relationships among the training 
words. Table 10 presents nearest neighbor words for 
some of the common words from the Weather- 
Gov dataset (according to cosine similarity in the 
embedding space). 

Table 10: Nearest neighbor word for example words 


Word 

Nearest neighbor 

gusts 

gust 

clear 

sunny 

isolated 

scattered 

southeast 

northeast 

storms 

winds 

decreasing 

falling 
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We also consider different ways of using pre¬ 
trained word embeddings (Mikolov et al., 2013) to 
bootstrap the quality of our learned embeddings. 
One approach initializes our embedding matrix with 
the pre-trained vectors and then refines the embed¬ 
ding based on our training corpus. The second con¬ 
catenates our learned embedding matrix with the 
pre-trained vectors in an effort to simultaneously ex¬ 
ploit general similarities as well as those learned 
for the domain. As shown previously for other 
tasks (Vinyals et al., 2014; Vinyals et al., 2015b), we 
find fhaf fhe use of pre-frained embeddings resulfs in 
negligible improvemenfs (on fhe developmenf sef). 
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